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Cloud computing has demonstrated that processing very large datasets over commodity 
clusters can be done simply, given the right programming model and infrastructure. In 
this paper, we describe the design and implementation of the Sector storage cloud and 
the Sphere compute cloud. By contrast with the existing storage and compute clouds, 
Sector can manage data not only within a data centre, but also across geographically 
distributed data centres. Similarly, the Sphere compute cloud supports user-defined 
functions (UDFs) over data both within and across data centres. As a special case, 
MapReduce-style programming can be implemented in Sphere by using a Map UDF 
followed by a Reduce UDF. We describe some experimental studies comparing 
Sector/Sphere and Hadoop using the Terasort benchmark. In these studies, Sector is 
approximately twice as fast as Hadoop. Sector/Sphere is open source. 
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1. Introduction 

By a cloud, we mean an infrastructure that provides on-demand resources or services 
over the Internet, usually at the scale and reliability of a data centre. A storage cloud 
provides storage services (block or file-based services); a data cloud provides data 
management services (record- based, column- based or object-based services); and a 
compute cloud provides computational services. Often these are stacked together to 
serve as a computing platform for developing cloud-based applications. 

Examples include Google's Google File System (GFS), BigTable and 
MapReduce infrastructure (Ghemawat et al. 2003; Dean & Ghemawat 2004; 
Chang et al. 2006); Amazon's S3 storage cloud, SimpleDB data cloud and EC2 
compute cloud (Amazon Web Services, http://aws.amazon.com/); and the open 
source Hadoop system (Hadoop, http://hadoop.apache.org/core), consisting of 
the Hadoop Distributed File System (HDFS), Hadoop's implementation 
of MapReduce, and HBase, an implementation of BigTable. 
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The implicit assumption with most high-performance computing systems is 
that the processors are the scarce resource, and hence shared. When processors 
become available, the data are moved to the processors. To simplify, this is the 
supercomputing model. An alternative approach is to store the data and to 
co-locate the computation with the data when possible. To simplify, this is the 
data centre model. 

Cloud computing platforms (GFS/MapReduce/BigTable and Hadoop) that 
have been developed thus far have been designed with two important 
restrictions. First, clouds have assumed that all the nodes in the cloud are 
co-located, i.e. within one data centre, or that there is relatively small bandwidth 
available between the geographically distributed clusters containing the data. 
Second, these clouds have assumed that individual inputs and outputs to the 
cloud are relatively small, although the aggregate data managed and processed 
are very large. This makes sense since most clouds to date have targeted Web 
applications in which large numbers of relatively small Web pages are collected 
and processed as inputs, and outputs consist of search queries that return 
relatively small lists of relevant pages. Although some e-Science applications 
have these characteristics, others must ingest relatively large datasets and 
process them. In addition, queries for certain e-Science applications also result in 
relatively large datasets being returned. 

By contrast, our assumption is that there are high-speed networks (10 Gb s _1 
or higher) connecting various geographically distributed clusters and 
that the cloud must support both the ingestion and the return of relatively 
large datasets. 

In this paper, we describe a storage cloud that we have developed called 
Sector and a compute cloud that we have developed called Sphere. Both of 
them are available as open source software programs from http://sector. 
sourcefor ge . net . 

Sector is a distributed storage system that can be deployed over a wide 
area and allows users to ingest and download large datasets from any 
location with a high-speed network connection to the system. In addition, 
Sector automatically replicates files for better reliability, availability and 
access throughout. Sector has been used to support the distributing Sloan 
Digital Sky Survey (SDSS) data releases to astronomers around the world 
(Gu et al. 2006). 

Sphere is a compute service built on top of Sector and provides a set of simple 
programming interfaces for users to write distributed data- intensive applications. 
Sphere implements the stream-processing paradigm, which is usually used in 
programming graphics processing unit (GPU; Owens et al. 2005) and multi-core 
processors. The stream-processing paradigm can be used to implement any 
MapReduce-supported applications . 

The rest of this paper will describe the details of Sector and Sphere in §§2 
and 3, respectively. Section 4 describes some experimental studies. Section 5 
describes related work and §6 is the summary and conclusion. 

This is an expanded version of a conference paper (Grossman & Gu 2008). This 
paper: (i) describes a later version of Sector that includes security, (ii) includes 
additional information about how Sector works, including how security and 
scheduling are designed, and (iii) describes new experimental studies. 
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Figure 1. The Sector system architecture. 



2. Sector 



(a) Overview 

Sector is a storage cloud as denned above. Specifically, Sector provides storage 
services over the Internet with the scalability and reliability of a data centre. 
Sector makes three assumptions. 

(i) Sector assumes that it has access to a large number of commodity 
computers (which we sometimes call nodes). The nodes may be located 
either within or across data centres. 

(ii) Sector assumes that high-speed networks connect the various nodes in 
the system. For example, in the experimental studies described below, the 
nodes within a rack are connected by 1 Gb s _1 networks, two racks within 
a data centre are connected by 10 Gb s _1 networks and two different data 
centres are connected by 10 Gb s _1 networks. 

(hi) Sector assumes that the datasets it stores are divided into one or more 
separate hies, which are called Sector slices. The different hies comprising 
a dataset are replicated and distributed over the various nodes managed 
by Sector. For example, one of the datasets managed by Sector in the 
experimental studies described below is a 1.3 TB dataset consisting of 
64 hies, each approximately 20.3 GB in size. 



Figure 1 shows the overall architecture of the Sector system. The security 
server maintains user accounts, user passwords and file access information. It 
also maintains lists of internetwork protocol (IP) addresses of the authorized 
slave nodes, so that illicit computers cannot join the system or send messages to 
interrupt the system. 

The master server maintains the metadata of the hies stored in the 
system, controls the running of all slave nodes and responds to users' requests. 
The master server communicates with the security server to verify the slaves, the 
clients and the users. 
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The slaves are the nodes that store the files managed by the system and 
process the data upon the request of a Sector client. The slaves are usually 
running on racks of computers that are located in one or more data centres. 

(6) File system management 

Sector is not a native file system; instead, it relies on the native file system on 
each slave node to store Sector slices. A critical element in the design of Sector is 
that each Sector slice is stored as one single file in the native file system. That is, 
Sector does not split Sector slices into smaller chunks. This design decision 
greatly simplifies the Sector system and provides several advantages. First, with 
this approach, Sector can recover all the metadata it requires by simply scanning 
the data directories on each slave. Second, a Sector user can connect to a single 
slave node to upload or download a file. By contrast, if a storage cloud manages 
data at the block level, then a user will generally need to connect to many slaves 
to access all the blocks in a file. The Hadoop system is an example of a storage 
cloud that manages files at the block level (Hadoop, http://hadoop.apache.org/ 
core). Third, Sector can interoperate with native file systems if necessary. 

A disadvantage of this approach is that it does require the user to break up large 
datasets into multiple files or to use a utility to accomplish this. Sector assumes 
that any user sophisticated enough to develop code for working with large datasets 
is sophisticated enough to split a large dataset into multiple files if required. 

The master maintains the metadata index required by Sector and supports 
file system queries, such as file lookup and directory services. The master 
also maintains the information about all slaves (e.g. available disk space) 
and the system topology, in order to choose slaves for better performance and 
resource usage. 

The current implementation assumes that Sector will be installed on a 
hierarchical topology, e.g. computer nodes on racks within multiple data centres. 
The topology is manually specified by a configuration file on the master server. 

The master checks the number of copies of each file periodically. If the number 
is below a threshold (the current default is 3), the master chooses a slave to make 
a new copy of the file. The new location of the file copy is based on the topology 
of the slaves' network. When a client requests a file, the master can choose a 
slave (that contains a copy of the file) that is close to the client and is not busy 
with other services. 

The Sector client supports standard file access application programming 
interfaces (APIs), such as open( ), read( ) and write ( ). These APIs can be 
wrapped to support other standards, such as Simple API for Grid Applications. 

( c ) Security 

Sector runs an independent security server. This design allows different 
security service providers to be deployed (e.g. lightweight directory access 
protocol and Kerberos). In addition, multiple Sector masters (for better 
reliability and availability) can use the same security service. 

A client logs onto the master server via a secure sockets layer (SSL) 
connection. The user name and the password are sent to the master. The master 
then sets up an SSL connection to the security server and asks to verify the 
credibility of the client. The security server checks its user database and sends 
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the result back to the master, along with a unique session ID and the client's file 
access privileges (for example its I/O permissions for different directories). 
In addition to the password, the client IP address is checked against an access 
control list defined for the user. Both SSL connections require the use of public 
certificates for verification. 

If the client requests access to a file, the master will check whether the user has 
access privileges for that file. If granted, the master chooses a slave node to serve 
the client. The slave and the client then set up an exclusive data connection that 
is coordinated by the master. Currently, the data connection is not encrypted, 
but we expect to add encryption in a future release. 

Sector slave nodes only accept commands from the Sector master. Neither Sector 
clients nor other slave nodes can send commands directly to a slave. All client-slave 
and slave-slave data transfer must be coordinated by the master node. 

Finally, the security server controls whether a slave can be added to the 
system. The security server maintains an IP list and/or an IP range so that only 
computers on this list can join as slaves. 

(d) Message and data transfer 

Sector uses user datagram protocol (UDP) for message passing and user 
defined type (UDT; Gu & Grossman 2007) for data transfer. UDP is faster than 
transmission control protocol (TCP) for message passing because it does not 
require connection set-up. We developed a reliable message passing library called 
group messaging protocol to use in Sector. For data transfer, a Sector slave will 
set up a UDT connection directly with the client. This UDT connection is set up 
using a rendezvous connection mode and is coordinated by the master. UDT is a 
high-performance data transfer protocol and significantly outperforms TCP over 
long-distance high-bandwidth links (Gu & Grossman 2007). 

A single UDP port is used for messaging and another single UDP port is used for 
all the data connections. A limited number of threads process the UDP packets, 
independently of the number of connections, which make the communication 
mechanism scale nicely as the number of nodes in the system increases. 



3. Sphere 

(a) Overview 

Recall that Sphere is a compute cloud that is layered over Sector. To introduce 
Sphere, consider the following example application. Assume we have 1 billion 
astronomical images of the Universe from the SDSS and the goal is to find brown 
dwarfs (stellar objects) in these images. Suppose the average size of an image is 
1 MB so that the total data size is 1 TB. The SDSS dataset is stored in 64 files, 
named SDSSl.dat, SDSS64.dat, each containing one or more images. 

In order to access an image randomly in the dataset (consisting of 64 files), we 
built an index file for each file. The index file indicates the start and end positions 
(i.e. offset and size) of each record (in this case, an image) in the data file. The 
index files are named by adding an '.idx' postfix to the data file name: 
SDSSl.dat.idx, SDSS60.dat.idx. 
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To use Sphere, the user writes a function 'findBrownDwarf to find brown 
dwarfs from each image. In this function, the input is an image, while the output 
indicates the brown dwarfs. 

findBrownDwarf (input, output); 

A standard serial program might look similar to this: 

for each file F in (SDSS slices) 
for each image I in F 

findBrownDwarf (I, ...); 

Using the Sphere client API, the corresponding pseudo code looks similar to this: 

SphereStream sdss ; 

sdss . init ( /*list of SDSS slices*/ ) ; 
SphereProcess myproc; 
Myproc . run ( sdss , 1 findBrownDwarf ' ) ; 
Myproc . read (result) ; 

In the pseudocode fragment above, 'sdss' is a Sector stream data structure that 
stores the metadata of the Sector slice files. The application can initialize the 
stream by giving it a list of file names. Sphere automatically retrieves the metadata 
from the Sector network. The last three lines will simply start the job and wait for 
the result using a small number of Sphere APIs. The users neither need to 
explicitly locate and move data, nor do they need to take care of message passing, 
scheduling and fault tolerance. 

(b) The computing paradigm 

As illustrated in the example above, Sphere uses a stream-processing computing 
paradigm. Stream processing is one of the most common ways in which GPU and 
multi-core processors are programmed. In Sphere, each slave processor is regarded 
as an arithmetic logic unit (ALU) in a GPU, or a processing core in a CPU. In the 
stream-processing paradigm, each element in the input data array is processed 
independently by the same processing function using multiple computing units. 
This paradigm is also called Single Program, Multiple Data, a term derived from 
Flynn's taxonomy of Single Instruction, Multiple Data (SIMD) for CPU design. 

We begin by explaining the key abstractions used in Sphere. Recall that a 
Sector dataset consists of one or more physical files. A stream is an abstraction in 
Sphere and it represents either a dataset or a part of a dataset. Sphere takes 
streams as inputs and produces streams as outputs. A Sphere stream consists of 
multiple data segments and the segments are processed by Sphere Processing 
Engines (SPEs) using slaves. An SPE can process a single data record from a 
segment, a group of data records or the complete segment. 

Figure 2 illustrates how Sphere processes the segments in a stream. Usually there 
are many more segments other than SPEs, which provides a simple mechanism for 
load balancing, since a slow SPE simply processes fewer segments. Each SPE takes 
a segment from a stream as an input and produces a segment of a stream as output. 
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Figure 2. The computing paradigm of Sphere. 

These output segments can in turn be the input segments to another Sphere 
process. For example, a sample function can be applied to the input stream and 
the resulting sample can be processed by another Sphere process. 

Figure 2 illustrates the basic model that Sphere supports. Sphere also supports 
some extensions of this model, which occur quite frequently. 

(i) Processing multiple input streams 

First, multiple input streams can be processed at the same time (for example, the 
operation A[ ] -\-B[ ] is supported). Note that this is not a straightforward extension, 
because it can be complex to split input streams and to assign segments to SPEs. 

(ii) Shuffling input streams 

Second, the output can be sent to multiple locations, rather than being just 
written to local disk. Sometimes this is called shuffling. For example, a user-defined 
function (UDF) can specify a bucket ID (that refers to a destination file on either a 
local or a remote node) for each record in the output, and Sphere will send this 
record to the specified destination. At the destination, Sphere receives results from 
many SPEs and writes them into a file, in the same order that they arrive. It is in 
this way that Sphere supports MapReduce-style computations (Dean &: 
Ghemawat 2004). 

Figure 3 shows an example that uses two Sphere processes (each process is 
called a stage) to implement distributed sorting. The first stage hashes the input 
data into multiple buckets. The hashing function scans the complete stream and 
places each element in a proper bucket. For example, if the data to be sorted are a 
collection of integers, the hashing function can place all data less than T 0 in bucket 
B 0 , data between T 0 and T\ in bucket B\, and so on. 
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Figure 3. Sorting large distributed datasets with Sphere. 

In stage 2, each bucket (which is a data segment) is sorted by an SPE. Note 
that after stage 2, the entire dataset (stream) is now sorted. This is because all 
elements in a bucket are smaller than all the elements in any buckets further along 
in the stream. 

Note that in stage 2, the SPE sorts the whole data segment and does not just 
process each record individually. 

(hi) SPE can process records or collections of records 

This is the third expansion to the basic model. In Sphere, an SPE can 
process a single record, multiple records, the whole segment or a complete file 
at one time. 



(c) Sphere processing engine 

Once the master accepts the client's request for Sphere data processing, it sends 
a list of available slave nodes to the client. The client then chooses some or all the 
slaves and requests that an SPE starts on these nodes. The client then sets up a 
UDT connection (for both control and data) with the SPE. The stream-processing 
functions, in the form of dynamic libraries, are sent to each SPE and stored locally 
on the slave node. The SPE then opens the dynamic libraries and obtains the 
various processing functions. Then it runs in a loop that consists of the following 
four steps. 

First, the SPE accepts a new data segment from the client containing the file 
name, offset, number of rows to be processed and various additional parameters. 

Next, the SPE reads the data segment (and the corresponding portion of 
the idx index file if it is available) from either the local disk or from another 
slave node. 
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As required, the stream-processing function processes either a single data 
record, a group of data records or the entire segment, and writes the result to the 
proper destinations. In addition, the SPE periodically sends acknowledgments to 
the client about the progress of the current processing. 

When the data segment is completely processed, the SPE sends an acknowl- 
edgement to the client to conclude the processing of the current data segment. 

If there are no more data segments to be processed, the client closes the 
connection to the SPE, and the SPE is released. The SPE may also timeout if 
the client is interrupted. 

(d) Sphere client 

The Sphere client provides a set of APIs that developers can use to write 
distributed applications. Developers can use these APIs to initialize input 
streams, upload processing function libraries, start Sphere processes and read 
the processing results. 

The client splits the input stream into multiple data segments, so that each can 
be processed independently by an SPE. The SPE can either write the result to the 
local disk and return the status of the processing, or it can return the result of 
the processing itself. The client tracks the status of each segment (for example, 
whether the segment has been processed) and holds the results. 

The client is responsible for orchestrating the complete running of each Sphere 
process. One of the design principles of the Sector/Sphere system is to leave 
most of the decision making to the client, so that the Sector master can be quite 
simple. In Sphere, the client is responsible for the control and scheduling of the 
program execution. 

(e) Scheduler 

(i) Data segmentation and SPE initialization 

The client first locates the data files in the input stream from Sector. If the input 
stream is the output stream of a previous stage, then this information is already 
within the Sector stream structure and no further segmentation is needed. 

Both the total data size and the total number of records are calculated in order 
to split the data into segments. This is based on the metadata of the data files 
retrieved from Sector. 

The client tries to uniformly distribute the input stream to the available SPEs 
by calculating the average data size per SPE. However, in consideration of the 
physical memory available per SPE and the data communication overhead per 
transaction, Sphere limits the data segment size between size boundaries <S min and 
<5max (the default values are 8 MB and 128 MB, respectively, but user-defined 
values are supported). In addition, the scheduler rounds the segment size to a 
whole number of records since a record cannot be split. The scheduler also requires 
that a data segment only contains records from a single data file. 

As a special case, the application may request that each data file be processed as 
a single segment. This would be the case, for example, if an existing application 
were designed to only process files. This is also the way the scheduler works when 
there is no record index associated with the data files. 
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(ii) SPE scheduling 

Once the input stream is segmented, the client assigns each segment to an SPE. 
The following rules are applied: 

(i) each data segment is assigned to an SPE on the same node if there is one 
available, 

(ii) segments from the same file are processed at the same time unless following 
this rule leaves SPEs idle, and 

(iii) if there are still idle SPEs available after rule (i) and rule (ii) are applied, 
assign them parts of data segments to process in the same order as they 
occur in the input stream. 

The first rule tries to run an SPE on the same node on which the data reside 
(in other words, to exploit data locality). This reduces the network traffic and 
yields better throughput. The second rule improves data access concurrency, 
because SPEs can read data from multiple files independently at the same time. 

As mentioned in §3c, SPEs periodically provide feedback about the progress of 
the processing. If an SPE does not provide any feedback about the progress of the 
processing before a timeout occurs, then the client discards the SPE. The segment 
being processed by the discarded SPE is assigned to another SPE, if one is 
available, or placed back into the pool of unassigned segments. This is the 
mechanism that Sphere uses to provide fault tolerance. Sphere does not use any 
check pointing in an SPE; when the processing of a data segment fails, it is 
completely reprocessed by another SPE. 

Fault tolerance is more complicated when SPEs write results to multiple 
destinations (as happens when using buckets for example). Each SPE dumps the 
result to a local disk before attempting to send the results to buckets on other nodes. 
In this way, if one node is down, the result can be sent to the same buckets on other 
nodes. Each bucket handler also records the status of incoming results from each data 
segment; thus, if one SPE is down, the bucket handler can continue to accept data in 
the correct order from another SPE that processes the same data segment again. 

If errors occur during the processing of a data segment due to problems with the 
input data or bugs in UDFs, the data segment will not be processed by any other 
SPE. Instead, an error report is sent back to the client, so that the application can 
take the appropriate action. 

In most cases, the number of data segments is significantly greater than the 
number of SPEs. For example, hundreds of machines might be used to process 
terabytes of data. As a consequence, the system is naturally load balanced, because 
all SPEs are kept busy during the majority of the runtime. Imbalances occur only 
towards the end of the computation when there are fewer and fewer data segments 
to process, causing some SPEs to be idle. 

Different SPEs can require different times to process data segments. There are 
several reasons for this, including: the slave nodes may not be dedicated; the slaves 
may have different hardware configurations (Sector systems can be heterogeneous) ; 
and different data segments may require different processing times. Near the end of 
the computation, when there are idle SPEs but incomplete data segments, each idle 
SPE is assigned one of the incomplete segments. That is, the remaining segments are 
run on more than one SPE and the client collects results from whichever SPE 
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finishes first. In this way, Sphere avoids waiting for the slow SPEs while the faster 
ones are idle. After processing is complete, the Sphere client can reorder the 
segments to correspond to the original order in the input stream. 

(/) Comparison with MapReduce 

Both the stream-processing framework used by Sphere and the MapReduce 
framework can be viewed as ways to simplify parallel programming. The approach of 
applying UDFs to segments managed by a storage cloud is more general than the 
MapReduce approach, in the sense that with Sphere it is easy to specify a map UDF 
and to follow it with a reduce UDF. We now describe how to do this in more detail. 

A MapReduce map process can be expressed directly by a Sphere process that 
writes the output stream to local storage. A MapReduce reduce process can be 
simulated by the hashing/bucket process of Sphere. In MapReduce, there is no 
data exchange between slave nodes in the map phase, while each reducer in the 
reduce phase reads data from all the slaves. In Sphere's version, the first stage 
hashes (key, value) pairs to buckets on other slave nodes, while in the second 
stage all data are processed locally at the slave by the reduce function. 

We illustrate this by showing how MapReduce and Sphere compute an 
inverted index for a collection of Web pages. Recall that the input is a collection 
of Web pages containing terms (words) and the output is a sorted list of pairs 
(w, < pages > ) , where w is a word that occurs in the collection and < pages > is a 
list of Web pages that contain the word w. The list is sorted on the first component. 

Computing an inverted index using Sphere requires two stages. In the first stage, 
each Web page is read, the terms are extracted and each term is hashed into a 
different bucket. Sphere automatically assigns each bucket to a separate slave for 
processing. Think of this as the hashing or shuffling stage. To be more concrete, all 
words starting with the letter 'a' can be assigned to the bucket 0, those beginning 
with the letter £ b' to the bucket 1, and so on. A more advanced hashing technique that 
would distribute the words more evenly could also be used. In the second stage, each 
bucket is processed independently by the slave node, which generates a portion of the 
inverted index. The inverted index consists of multiple files managed by Sector. 

For example, assume that there are two Web pages (each is a separate file): 
wl.html and w2.html. Assume that wl contains the words bee and cow and that w2 
contains the words bee and camel. In the first stage of Sphere, bucket 1 will 
contain (bee, wl) and (bee, w2), and bucket 2 will contain (cow, wl) and (camel, 
w2). In the second stage, each bucket is processed separately. Bucket 1 becomes 
(bee, (wl, w2)) and bucket 2 remains unchanged. In this way, the inverted index is 
computed and the result is stored in multiple files (bucket files). 

In Hadoop's MapReduce (Hadoop, http://hadoop.apache.org/core), the map 
phase would generate four intermediate files containing (bee, wl), (cow, wl), 
(bee, w2) and (camel, w2). In the reduce phase, the reducer will merge the same 
keys and generate three items (bee, (wl, w2)), (cow, wl) and (camel, w2). 

4. Experimental studies 

We have released Sector /Sphere as open source software and used it in a variety of 
applications. We have also analysed its performance using the Terasort benchmark 
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Figure 4. File downloading performance on the Teraflow Testbed. 



(Govindaraju et al. 2006; Borthaku 2007). In this section, we describe two 
Sector/Sphere applications and discuss their performance. 

(a) SDSS data distribution 

Sector is currently used to distribute the data products from the SDSS over the 
Teraflow Testbed (Gu et al. 2006; Teraflow Testbed, http: / /www.teraflowtestbed. 
net). We set up multiple Sector servers on the Teraflow Testbed that we use to 
store the SDSS data. We stored the 13 TB SDSS Data Release 5 (DR5), which 
contains 60 catalogue files, 64 catalogue files in EFG format and 257 raw image 
data collection files. We also stored the 14 TB SDSS Data Release 6 (DR6), which 
contains 60 catalogue files, 60 Segue files and 268 raw image collection files. The 
size of each of these files varies between 5 and 100 GB. 

We uploaded the SDSS files to several specific locations in order to better cover 
North America, Asia Pacific and Europe. We then set up a website (sdss.ncdm.uic. 
edu) , so that the users could easily obtain a Sector client application and the list of 
SDSS files to download. The MD5 checksum for each file is also posted on the 
website, so that users can check the integrity of the files. 

The system has been online since July 2006. During the last 2 years, we have 
had approximately 6000 system accesses and a total of 250 TB of data that were 
transferred to the end users. Approximately 80 per cent of the users are just 
interested in the catalogue files, which contain files that range in size between 20 
and 25 GB each. 

Figure 4 shows the file downloading performance in an experiment of our own, 
where the clients are also connected to the Teraflow Testbed by 10 Gb s _1 links. 
In this experiment, the bottleneck is the disk IO speed. 

Figure 5 shows a distribution of the data transfer throughput of actual transfers 
to the end users during the last 18 months. In most of the SDSS downloads, the 
bottleneck is the network connecting the Teraflow Testbed to the end user and 
simply using multiple parallel downloads will not help. The SDSS downloads are 
currently distributed as follows: 31 per cent are from the USA; 37.5 per cent 
are from Europe; 18.8 per cent are from Asia; and 12.5 per cent are from Australia. 
The transfer throughput to users varies from 8 Mb s 1 (to India) to 900 Mb s 1 
(to Pasadena, CA), all via public networks. More records can be found on sdss. 
ncdm. uic . edu/ records . html . 
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Figure 5. Performance of SDSS distribution to end users. 



(b) Terasort 

We implemented the Terasort benchmark to evaluate the performance of 
Sphere (Govindaraju et al. 2006; Sort Benchmark, http://www.hpl.hp.com/ 
hosted/sortbenchmark/). Suppose there are N nodes in the system, the benchmark 
generates a 10 GB file on each node and sorts the total iVX 10 GB data. Each 
record contains a 10-byte key and a 90-byte value. The Sphere implementation 
follows the bucket sorting algorithm depicted in figure 3. 

The experimental studies summarized in table 1 were done using the Open Cloud 
Testbed (http://www.opencloudconsortium.org). Currently, the testbed consists 
of four racks. Each rack has 32 nodes, including 1 NFS server, 1 head node and 30 
compute/slave nodes. The head node is a Dell 1950, dual dual-core Xeon 3.0 GHz 
and 16 GB RAM. The compute nodes are Dell 1435s, single dual-core AMD 
Opteron 2.0 GHz, 4 GB RAM and 1 TB single disk. The four racks are located in 
JHU (Baltimore), StarLight (Chicago), UIC (Chicago) and Calit2 (San Diego). 

The nodes on each rack are connected by two Cisco 3750E switches, but only a 
1 Gbs -1 connection is enabled at this time (a maximum of 2Gbs _1 can be 
enabled in/out each node). The bandwidth between racks is 10 Gb s _1 . The wide 
area links are provided by Cisco's C-Wave, which uses resources from the National 
Lambda Rail. Links from regional 10 GE research networks are used to connect 
the C-Wave to the racks in the testbed. 

Both Sector (http://sector.sourceforge.net) and Hadoop (http://hadoop. 
apache.org/core) are deployed over the 120-node (240-core) wide area system. 
The master server for Sector and the name node/job tracker of Hadoop are 
installed on one or more of the four head nodes. Both the Sphere client and the 
Hadoop client submit the job from a node in the system. This does not affect 
the performance since the traffic to/from the clients is negligible. 

Table 1 lists the performance for the Terasort benchmark for both Sphere and 
Hadoop. The time is in seconds and time to generate the data is not included. Note 
that it is normal to see longer processing time for more nodes, because the total 
amount of data also increases proportionally. 

In this experiment, we sort 300, 600, 900 GB and 1.2 TB data over 30, 60, 90 
and 120 nodes, respectively. For example, in the last case, the 1.2 TB data are 
distributed on four racks located in four data centres across the USA. All 120 
nodes participated in the sorting process and essentially all of the 1.2 TB data are 
moved across the testbed during the sort. 
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Table 1. The Terasort benchmark for Sector/ Sphere and Hadoop. (All times are in seconds.) 



Sector/ 
Sphere 



Hadoop 

three 

replicas 



Hadoop 



one 



replica 



UIC (one location, 30 nodes) 

UIC + StarLight (two locations, 60 nodes) 

UIC + StarLight + Calit2 (three locations, 90 nodes) 

UIC + StarLight + Calit2 + ,IHU (four locations, 120 nodes) 



1265 
1361 
1430 
1526 



2889 
2896 
4341 
6675 



2252 
2617 
3069 
3702 



Both Sector and Hadoop replicate data for safety. The default replication strategy 
for both is to generate three replicas. Their replication strategies are different 
though. Hadoop replicates data during the initial writing, while Sector checks 
periodically, and, if there are not a sufficient number of replicas, it creates them. For 
this reason, table 1 reports results for Hadoop with the replication sent to a 
replication factor of 1 (no replication) as well as the default replication factor of 3. 

The results show that Sphere is about twice as fast as Hadoop (when Hadoop's 
replication factor is set to 1). Moreover, Sphere scales better as the number of 
racks increases (1526/1265 = 1.2 for Sphere versus 3702/2252 = 1.6 for Hadoop). 



In this section, we describe some work related to Sector and Sphere. Sector provides 
some of the functionality of distributed file systems (DFSs), such as General 
Parallel File System, Lustre and Parallel Virtual File System (Kramer et al. 2004). 
DFSs provide the functionality of a file system on clusters of computers, sometimes, 
although rarely, over geographically distributed locations. While DFS may be 
suitable for a single organization with dedicated hardware and management, it is 
challenging to deploy and operate a regular DFS on a loosely coupled infrastructure 
consisting of commodity computers, such as those used for the experimental 
studies described here. 

On the other hand, the GFS (Ghemawat et al. 2003), the HDFS (Borthaku 2007) 
and Sector are special purpose file systems. They are particularly optimized for large 
files, for large scanning reads and for short random reads. By contrast with Sector, 
neither GFS nor HDFS was designed for nodes distributed over a wide area network. 

The Sector servers that are deployed over the Teraflow Testbed and used for 
distributing the SDSS data provide the functionality of a content distribution 
network, such as Akamai (Dilley et al. 2002). Akamai is designed to distribute 
large numbers of relatively small files and keeps a cache at most edge nodes of its 
network. In contrast to Akamai, Sector is designed to distribute relatively small 
numbers of large files and maintains copies at several, but not all, edge nodes. 

The stream-processing paradigm in Sphere is currently quite popular in the 
general-purpose GPU programming (GPGPU) community (Owens et al. 2005). 
The approach with GPGPU programming is to define a special 'kernel function' 
that is applied to each element in the input data by the GPU's vector computing 
units. This can be viewed as an example of an SIMD style of programming. Many 
GPU programming libraries and programming languages (e.g. Cg, sh and Brook) 



5. Related work 
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Table 2. A summary of some of the differences between Sector/Sphere and GFS/BigTable 

and Hadoop. 



design decision 


GFS, BigTable 


Hadoop 


Sector / Sphere 


datasets divided into 


blocks 


blocks 


files 


files or into blocks 








protocol for message 


TCP 


TCP 


group messaging protocol 


passing within the 








system 








protocol for transferring 


TCP 


TCP 


UDP-based data transport 


data 








programming model 


Map Reduce 


MapReduce 


user-defined functions applied 
to segments 


replication strategy 


replicas created at 


replicas created at 


replicas created periodically by 




the time of 


the time of 


system 




writing 


writing 




support high-volume 


no 


no 


yes, using UDT 


inflows and outflows 








security model 


not mentioned 


none 


user-level and file-level access 
controls 


language 


C++ 


Java 


C++ 



have been developed. Similar ideas have also been applied to multi-core processors, 
including the Cell processor. For example, specialized parallel sorting algorithms 
have been developed for both GPU processors (Govindaraju et al. 2006) and the 
Cell processor (Gedik et al. 2007). 

Sphere uses the same basic idea, but extends this paradigm to wide-area 
distributed computing. Many of the GPGPU algorithms and applications can be 
adapted and run in a distributed fashion using Sphere. In fact, it is this analogue to 
GPGPU which inspired our work on Sphere. 

There are some important differences though: Sphere uses heterogeneous 
distributed computers connected by high-speed, wide-area networks instead of the 
identical ALUs integrated in a GPU; Sphere supports more flexible movement of 
data, but also requires load balancing and fault tolerance; finally, the bandwidth 
between Sphere's SPEs, although it may be up to 10 Gb s~ , is not even close to 
the bandwidth within a GPU. Owing to these differences, Sphere runs a complete 
processing function or program on each SPE, rather than one instruction. 

One way of viewing GPGPU style programming is as a parallel programming 
style that gains simplicity by restricting the type of application that it is targeting. 
By contrast, message passing systems such as message-passing interface (Gropp 
et al. 1999) are designed for very general classes of applications but are usually 
harder to program. Google's MapReduce (Dean & Ghemawat 2004) is one of the 
most well-known examples of a system that targets a limited class of applications, 
but is relatively simple to program. As shown above, the Sphere system is similar 
to but more general than MapReduce. Sphere is a generalization of MapReduce, in 
the sense that it provides a simple mechanism to execute UDFs over data managed 
by Sector. Sphere can implement a MapReduce by using a UDF map followed by a 
UDF reduce. 
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See table 2 for a summary of some of the differences between Sector/Sphere and 
other systems for cloud computing. 

Sector/Sphere is similar to grid computing in that it aggregates distributed 
computing resources, but its approach is quite different. Traditional grid systems 
such as the Globus Toolkit (Foster 2005) and Condor (Thain et al. 2005) allow 
users to submit multiple tasks and to run these tasks in parallel. Grid job 
schedulers such as Swift (Zhao et al. 2007) and the Condor Directed Acyclic Graph 
Manager (http://www.cs.wisc.edu/condor/dagman) provide workflow services 
to support the scheduling of user tasks. Grid systems manage relationships 
among many tasks. By contrast, the Sphere client scheduler exploits data 
parallelism within one task. In this sense, grid computing is task oriented (multiple 
tasks processing one or more datasets), while Sphere is data oriented (single 
program processing a single dataset). 

In addition, a grid application submits a user's tasks to computing resources and 
moves data to these resources for computation, whereas Sector provides long-term 
persistent storage for data and Sphere is designed to start operations as close to the 
data as possible. In the Sphere-targeted scenarios, datasets are usually very large 
and moving them is considerably expensive. To summarize, grid systems are 
designed to manage scarce specialized computing resources, while storage clouds, 
such as Sector, are designed to manage large datasets and compute clouds, such as 
Sphere, are designed to support computation over this data. 

Finally, note that Sphere is very different from systems that process streaming 
data such as GATES (Chen et al. 2004) and DataCutter (Beynon et al. 2000) or 
event stream-processing systems such as STREAM, Borealis, and TelegraphCQ 
(Babcock et al. 2002). While Sphere is designed to support large datasets, the data 
being processed are still treated as finite and static and are processed in a data- 
parallel model. By contrast, event stream-processing systems regard the input as 
infinite and process the data with a windowed model, sometimes with filters 
incorporating timing restraints in order to guarantee real-time processing. 

6. Conclusions 

For several years now, commodity clusters have been quite common. Over the next 
several years, wide-area, high-performance networks (10 Gbs -1 and higher) will 
begin to connect these clusters. At the risk of oversimplifying, it is useful to think of 
high-performance computing today as an era in which cycles are the scarce resource, 
and (relatively small) datasets are scattered to large pools of nodes when their wait 
in the queue is over. By contrast, we are moving to an era in which there are large 
distributed datasets that must be persisted on disk for long periods of time, and 
high-performance computing must be accomplished in a manner that moves the 
data as little as possible, due to the costs incurred when transporting large datasets. 

Sector and Sphere are designed for these types of applications involving large, 
geographically distributed datasets in which the data can be naturally processed in 
parallel. Sector manages the large distributed datasets with high reliability, high- 
performance IO and a uniform access. Sphere makes use of the Sector-distributed 
storage system to simplify data access, increase data IO bandwidth and to exploit 
wide-area, high-performance networks. Sphere presents a very simple programming 
interface by hiding data movement, load balancing and fault tolerance. 
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